Unsupervised Two-Way Clustering of Metagenomic Sequences

نویسندگان

Shruthi Prabhakara

Raj Acharya

چکیده

A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. The efficacy of clustering methods depends on the number of reads in the dataset, the read length and relative abundances of source genomes in the microbial community. In this paper, we formulate an unsupervised naive Bayes multispecies, multidimensional mixture model for reads from a metagenome. We use the proposed model to cluster metagenomic reads by their species of origin and to characterize the abundance of each species. We model the distribution of word counts along a genome as a Gaussian for shorter, frequent words and as a Poisson for longer words that are rare. We employ either a mixture of Gaussians or mixture of Poissons to model reads within each bin. Further, we handle the high-dimensionality and sparsity associated with the data, by grouping the set of words comprising the reads, resulting in a two-way mixture model. Finally, we demonstrate the accuracy and applicability of this method on simulated and real metagenomes. Our method can accurately cluster reads as short as 100 bps and is robust to varying abundances, divergences and read lengths.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Erratum to “Unsupervised Two-Way Clustering of Metagenomic Sequences”

and Bahrad Sokhansanj, " Metagenome fragment classification using N-mer frequency profiles, " Advances in Bioinfor-matics, Volume 2008 (2008). "

متن کامل

A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples

Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of ...

متن کامل

MC-MinH: Metagenome Clustering using Minwise based Hashing

Current bio-technologies allow sequencing of genomes from multiple organisms, that co-exist as communities within ecological environments. This collective genomic process (called metagenomics) has spurred the development of several computational tools for the quantification of abundance, diversity and role of different species within different communities. Unsupervised clustering algorithms (al...

متن کامل

Comparison Between Unsupervised and Supervise Fuzzy Clustering Method in Interactive Mode to Obtain the Best Result for Extract Subtle Patterns from Seismic Facies Maps

Pattern recognition on seismic data is a useful technique for generating seismic facies maps that capture changes in the geological depositional setting. Seismic facies analysis can be performed using the supervised and unsupervised pattern recognition methods. Each of these methods has its own advantages and disadvantages. In this paper, we compared and evaluated the capability of two unsuperv...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 2012 شماره

صفحات -

تاریخ انتشار 2012

Unsupervised Two-Way Clustering of Metagenomic Sequences

نویسندگان

چکیده

منابع مشابه

Erratum to “Unsupervised Two-Way Clustering of Metagenomic Sequences”

A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples

MC-MinH: Metagenome Clustering using Minwise based Hashing

Comparison Between Unsupervised and Supervise Fuzzy Clustering Method in Interactive Mode to Obtain the Best Result for Extract Subtle Patterns from Seismic Facies Maps

High-Dimensional Unsupervised Active Learning Method

عنوان ژورنال:

اشتراک گذاری